fix(cli): reject execute() immediately when child process is dead#2978
fix(cli): reject execute() immediately when child process is dead#2978
Conversation
When a child process crashes and a retry is attempted on the same TaskRunProcess, execute() would hang forever because the IPC send was silently skipped and the attempt promise could never resolve. This caused runner pods to stay up indefinitely with no heartbeats.
🦋 Changeset detectedLatest commit: f3049f6 The changes in this PR will be included in the next version bump. This PR includes changesets to release 28 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WalkthroughThis pull request fixes a hang issue when Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review CompleteYour review story is ready! Comment !reviewfast on this PR to re-generate the story. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug where runner pods would hang indefinitely when attempting to retry a task execution on a crashed child process. The execute() method would silently skip the IPC send to the dead process but never resolve or reject its attempt promise, causing the runner to stop processing work without exiting.
Changes:
- Modified
TaskRunProcess.execute()to immediately reject the attempt promise when the child process is not connected - Added comprehensive test coverage to verify the fix prevents hanging behavior
- Added changeset documenting the patch
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/cli-v3/src/executions/taskRunProcess.ts | Added else branch to reject attempt promise when child process is disconnected |
| packages/cli-v3/src/executions/taskRunProcess.test.ts | New test file verifying execute() rejects promptly instead of hanging on dead processes |
| .changeset/fix-dead-process-execute-hang.md | Changeset documenting the bug fix |
…iggerdotdev#2978) ## Summary - When a child process crashes and a retry (`RETRY_IMMEDIATELY`) is attempted on the same `TaskRunProcess`, `execute()` hangs forever because the IPC send is silently skipped and the attempt promise can never resolve - This caused runner pods to stay up indefinitely with no heartbeats or polls - Fix: reject the attempt promise immediately when the child is not connected, so the controller can proceed to warm start or exit ## Test plan - [x] Added `taskRunProcess.test.ts` — verifies `execute()` rejects promptly instead of hanging when the child process is dead - [x] Deploy and verify no more stuck runner pods accumulate over time
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and publish to npm yourself or [setup this action to publish automatically](https://github.com/changesets/action#with-publishing). If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @trigger.dev/sdk@4.4.0 ### Minor Changes - Added `query.execute()` which lets you query your Trigger.dev data using TRQL (Trigger Query Language) and returns results as typed JSON rows or CSV. It supports configurable scope (environment, project, or organization), time filtering via `period` or `from`/`to` ranges, and a `format` option for JSON or CSV output. ([#3060](#3060)) ```typescript import { query } from "@trigger.dev/sdk"; import type { QueryTable } from "@trigger.dev/sdk"; // Basic untyped query const result = await query.execute("SELECT run_id, status FROM runs LIMIT 10"); // Type-safe query using QueryTable to pick specific columns const typedResult = await query.execute<QueryTable<"runs", "run_id" | "status" | "triggered_at">>( "SELECT run_id, status, triggered_at FROM runs LIMIT 10" ); typedResult.results.forEach((row) => { console.log(row.run_id, row.status); // Fully typed }); // Aggregation query with inline types const stats = await query.execute<{ status: string; count: number }>( "SELECT status, COUNT(*) as count FROM runs GROUP BY status", { scope: "project", period: "30d" } ); // CSV export const csv = await query.execute("SELECT run_id, status FROM runs", { format: "csv", period: "7d", }); console.log(csv.results); // Raw CSV string ``` ### Patch Changes - Add `maxDelay` option to debounce feature. This allows setting a maximum time limit for how long a debounced run can be delayed, ensuring execution happens within a specified window even with continuous triggers. ([#2984](#2984)) ```typescript await myTask.trigger(payload, { debounce: { key: "my-key", delay: "5s", maxDelay: "30m", // Execute within 30 minutes regardless of continuous triggers }, }); ``` - Aligned the SDK's `getRunIdForOptions` logic with the Core package to handle semantic targets (`root`, `parent`) in root tasks. ([#2874](#2874)) - Export `AnyOnStartAttemptHookFunction` type to allow defining `onStartAttempt` hooks for individual tasks. ([#2966](#2966)) - Fixed a minor issue in the deployment command on distinguishing between local builds for the cloud vs local builds for self-hosting setups. ([#3070](#3070)) - Updated dependencies: - `@trigger.dev/core@4.4.0` ## @trigger.dev/build@4.4.0 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.0` ## trigger.dev@4.4.0 ### Patch Changes - Fix runner getting stuck indefinitely when `execute()` is called on a dead child process. ([#2978](#2978)) - Add optional `timeoutInSeconds` parameter to the `wait_for_run_to_complete` MCP tool. Defaults to 60 seconds. If the run doesn't complete within the timeout, the current state of the run is returned instead of waiting indefinitely. ([#3035](#3035)) - Fixed a minor issue in the deployment command on distinguishing between local builds for the cloud vs local builds for self-hosting setups. ([#3070](#3070)) - Updated dependencies: - `@trigger.dev/core@4.4.0` - `@trigger.dev/build@4.4.0` - `@trigger.dev/schema-to-json@4.4.0` ## @trigger.dev/core@4.4.0 ### Patch Changes - Add `maxDelay` option to debounce feature. This allows setting a maximum time limit for how long a debounced run can be delayed, ensuring execution happens within a specified window even with continuous triggers. ([#2984](#2984)) ```typescript await myTask.trigger(payload, { debounce: { key: "my-key", delay: "5s", maxDelay: "30m", // Execute within 30 minutes regardless of continuous triggers }, }); ``` - Fixed a minor issue in the deployment command on distinguishing between local builds for the cloud vs local builds for self-hosting setups. ([#3070](#3070)) - fix: vendor superjson to fix ESM/CJS compatibility ([#2949](#2949)) Bundle superjson during build to avoid `ERR_REQUIRE_ESM` errors on Node.js versions that don't support `require(ESM)` by default (< 22.12.0) and AWS Lambda which intentionally disables it. - Add Vercel integration support to API schemas: `commitSHA` and `integrationDeployments` on deployment responses, and `source` field for environment variable imports. ([#2994](#2994)) ## @trigger.dev/python@4.4.0 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.0` - `@trigger.dev/sdk@4.4.0` - `@trigger.dev/build@4.4.0` ## @trigger.dev/react-hooks@4.4.0 ### Patch Changes - Fix `onComplete` callback firing prematurely when the realtime stream disconnects before the run finishes. ([#2929](#2929)) - Updated dependencies: - `@trigger.dev/core@4.4.0` ## @trigger.dev/redis-worker@4.4.0 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.0` ## @trigger.dev/rsc@4.4.0 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.0` ## @trigger.dev/schema-to-json@4.4.0 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.0` --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Summary
RETRY_IMMEDIATELY) is attempted on the sameTaskRunProcess,execute()hangs forever because the IPC send is silently skipped and the attempt promise can never resolveTest plan
taskRunProcess.test.ts— verifiesexecute()rejects promptly instead of hanging when the child process is dead